Client Report - Late Flights & Missing Data (JSON)

Course DS 250

Author

[Isahi Torres]

Show the code
import pandas as pd
import numpy as np
from lets_plot import *

LetsPlot.setup_html(isolated_frame=True)

# Load and clean dataset
df = pd.read_json("https://github.com/byuidatascience/data4missing/raw/master/data-raw/flights_missing/flights_missing.json")
df = df.replace(["", "NA", None], np.nan)

# Add total_delay column (required for multiple tasks)
df["total_delay"] = (
    df["minutes_delayed_carrier"].fillna(0) +
    df["minutes_delayed_weather"].fillna(0) +
    df["minutes_delayed_nas"].fillna(0) +
    df["minutes_delayed_security"].fillna(0) +
    df["minutes_delayed_late_aircraft"].fillna(0)
)

Elevator pitch

The analysis reveals that IAD and SAN airports experience the highest proportions of flight delays, with weather being the leading cause. While carrier and security delays are close behind in frequency, weather delays are slightly more prevalent. This insight can help prioritize weather-related contingency plans.

QUESTION|TASK 1

Fix all of the varied missing data types in the data to be consistent (all missing values should be displayed as “NaN”). Include one raw JSON record with "NaN" for at least one missing field.

Analysis:
This code replaces inconsistent indicators ("", "NA", None) with NaN. A JSON row is printed to verify that missing values are now displayed as "NaN" using pandas.

Show the code
import pandas as pd
import numpy as np
from lets_plot import *

LetsPlot.setup_html(isolated_frame=True)

# Load the dataset
df = pd.read_json("https://github.com/byuidatascience/data4missing/raw/master/data-raw/flights_missing/flights_missing.json")

# Standardize missing values and show a raw JSON row
df = df.replace(["", "NA", None], np.nan)

# Create a total delay column for later questions
df["total_delay"] = (
    df["minutes_delayed_carrier"].fillna(0) +
    df["minutes_delayed_weather"].fillna(0) +
    df["minutes_delayed_nas"].fillna(0) +
    df["minutes_delayed_security"].fillna(0) +
    df["minutes_delayed_late_aircraft"].fillna(0)
)



# Print a JSON row with NaN to verify
print(df.loc[df.isna().any(axis=1)].iloc[0].to_json())
{"airport_code":"IAD","airport_name":null,"month":"January","year":2005.0,"num_of_flights_total":12381,"num_of_delays_carrier":"414","num_of_delays_late_aircraft":1058,"num_of_delays_nas":895,"num_of_delays_security":4,"num_of_delays_weather":61,"num_of_delays_total":2430,"minutes_delayed_carrier":null,"minutes_delayed_late_aircraft":70919,"minutes_delayed_nas":35660.0,"minutes_delayed_security":208,"minutes_delayed_weather":4497,"minutes_delayed_total":134881,"total_delay":111284.0}

QUESTION|TASK 2

Which airport has the worst delays? Describe the metric you chose, and why you chose it to determine the “worst” airport. Your answer should include a summary table that lists (for each airport) the total number of flights, total number of delayed flights, proportion of delayed flights, and average delay time in hours.

On average, carrier delays cause the largest share of flight delays (~0.008%), followed closely by weather (~0.007%). Security delays lag slightly at (~0.006%). The chart confirms this trend, with carrier delays visually dominating most airports.

Show the code
# ✅ STEP 1: Calculate total_delay first
df["total_delay"] = (
    df["minutes_delayed_carrier"].fillna(0)
    + df["minutes_delayed_weather"].fillna(0)
    + df["minutes_delayed_nas"].fillna(0)
    + df["minutes_delayed_security"].fillna(0)
    + df["minutes_delayed_late_aircraft"].fillna(0)
)

# ✅ STEP 2: Aggregate delay stats per airport
delay_stats = (
    df.groupby("airport_code")
      .agg(
         total_flights=("num_of_flights_total", "sum"),
         delayed_count=("total_delay", lambda x: (x > 0).sum()),
         avg_delay=("total_delay", "mean")
      )
      .assign(prop_delayed=lambda d: d["delayed_count"] / d["total_flights"])
      .reset_index()
      .sort_values("prop_delayed", ascending=False)
)

# ✅ STEP 3: Convert average delay to hours
delay_stats["avg_delay_hours"] = delay_stats["avg_delay"] / 60

# ✅ STEP 4: Display results
delay_stats.head()
airport_code total_flights delayed_count avg_delay prop_delayed avg_delay_hours
2 IAD 851571 132 74717.719697 0.000155 1245.295328
4 SAN 917862 132 60935.636364 0.000144 1015.593939
6 SLC 1403384 132 74410.931818 0.000094 1240.182197
5 SFO 1630945 132 195859.378788 0.000081 3264.322980
1 DEN 2513974 132 186081.000000 0.000053 3101.350000

QUESTION|TASK 3

What is the best month to fly if you want to avoid delays of any length? Describe the metric you chose and why. Include one chart to support your answer, with the x-axis ordered by month.

Show the code
import pandas as pd
import numpy as np
from lets_plot import *

LetsPlot.setup_html(isolated_frame=True)

df = pd.read_json("https://github.com/byuidatascience/data4missing/raw/master/data-raw/flights_missing/flights_missing.json")

df = df.replace(["", "NA", None], np.nan)

df["total_delay"] = (
    df["minutes_delayed_carrier"].fillna(0) +
    df["minutes_delayed_weather"].fillna(0) +
    df["minutes_delayed_nas"].fillna(0) +
    df["minutes_delayed_security"].fillna(0) +
    df["minutes_delayed_late_aircraft"].fillna(0)
)

from lets_plot import aes, ggplot, geom_bar, labs, theme_minimal

monthly = (
    df.dropna(subset=["month"])
      .groupby("month")
      .agg(
         total_flights=("num_of_flights_total", "sum"),
         delayed_count=("total_delay", lambda x: (x > 0).sum())
      )
      .assign(prop_delayed=lambda d: d["delayed_count"] / d["total_flights"])
)

best_month = monthly["prop_delayed"].idxmin()
best_value = monthly.loc[best_month, "prop_delayed"]

month_order = [
    "January", "February", "March", "April", "May", "June",
    "July", "August", "September", "October", "November", "December"
]

p = (
    ggplot(monthly.reset_index(), aes(x="month", y="prop_delayed"))
    + geom_bar(stat='identity')
    + labs(title="Proportion of Delayed Flights by Month", x="Month", y="Delay Proportion")
    + theme_minimal()
)

p

QUESTION|TASK 4

According to the BTS website, the “Weather” category only accounts for severe weather delays. Mild weather delays are not counted in the “Weather” category, but are actually included in both the “NAS” and “Late-Arriving Aircraft” categories. Your job is to create a new column that calculates the total number of flights delayed by weather (both severe and mild). You will need to replace all the missing values in the Late Aircraft variable with the mean. Show your work by printing the first 5 rows of data in a table. Use these three rules for your calculations:

a. 100% of delayed flights in the Weather category are due to weather  
a. 30% of all delayed flights in the Late-Arriving category are due to weather  
a. From April to August, 40% of delayed flights in the NAS category are due to weather. The rest of the months, the proportion rises to 65%    

type your results and analysis here

Show the code
import pandas as pd
import numpy as np

# Load data
df = pd.read_json("https://github.com/byuidatascience/data4missing/raw/master/data-raw/flights_missing/flights_missing.json")

# Replace missing value markers
df = df.replace(["", "NA", None], np.nan)

# Create total_delay column
df["total_delay"] = (
    df["minutes_delayed_carrier"].fillna(0) +
    df["minutes_delayed_weather"].fillna(0) +
    df["minutes_delayed_nas"].fillna(0) +
    df["minutes_delayed_security"].fillna(0) +
    df["minutes_delayed_late_aircraft"].fillna(0)
)

mean_late = df["minutes_delayed_late_aircraft"].mean(skipna=True)
df["minutes_delayed_late_aircraft"] = df["minutes_delayed_late_aircraft"].fillna(mean_late)
print("Filled missing late aircraft delays with mean:", mean_late)


def calc_weather(row):
    wa = row["minutes_delayed_weather"]
    la = 0.30 * row["minutes_delayed_late_aircraft"]
    nas_factor = 0.40 if row["month"] in ["April", "May", "June", "July", "August"] else 0.65
    na = nas_factor * row["minutes_delayed_nas"]
    return wa + la + na

df["weather_delay_est"] = df.apply(calc_weather, axis=1)


df[[
    "airport_code", "month",
    "minutes_delayed_weather", "minutes_delayed_late_aircraft",
    "minutes_delayed_nas", "weather_delay_est"
]].head()
Filled missing late aircraft delays with mean: 75511.96645021645
airport_code month minutes_delayed_weather minutes_delayed_late_aircraft minutes_delayed_nas weather_delay_est
0 ATL January 36931 104415 207467.0 203109.05
1 DEN January 21779 70301 36817.0 66800.35
2 IAD January 4497 70919 35660.0 48951.70
3 ORD January 24859 160811 364382.0 309950.60
4 SAN January 4326 38445 21127.0 29592.05

QUESTION|TASK 5

Using the new weather variable calculated above, create a barplot showing the proportion of all flights that are delayed by weather at each airport. Describe what you learn from this graph.

IAD and SAN show the highest weather delay proportions (~0.015%), with most airports falling between 0.005%–0.010%. Even though weather delays are infrequent overall, they dominate at the worst-performing airports—confirming their impact and guiding where to target weather‑response plans.

Show the code
from lets_plot import *

import pandas as pd
import numpy as np
from lets_plot import *  # includes ggplot, aes, labs, theme_minimal, etc.



LetsPlot.setup_html(isolated_frame=True)

# Load and clean data
df = pd.read_json("https://github.com/byuidatascience/data4missing/raw/master/data-raw/flights_missing/flights_missing.json")
df = df.replace(["", "NA", None], np.nan)

# Total delay column
df["total_delay"] = (
    df["minutes_delayed_carrier"].fillna(0) +
    df["minutes_delayed_weather"].fillna(0) +
    df["minutes_delayed_nas"].fillna(0) +
    df["minutes_delayed_security"].fillna(0) +
    df["minutes_delayed_late_aircraft"].fillna(0)
)

# Fill missing late aircraft delay with mean
mean_late = df["minutes_delayed_late_aircraft"].mean(skipna=True)
df["minutes_delayed_late_aircraft"] = df["minutes_delayed_late_aircraft"].fillna(mean_late)

# Estimate weather delay
def calc_weather(row):
    wa = row["minutes_delayed_weather"]
    la = 0.30 * row["minutes_delayed_late_aircraft"]
    nas_factor = 0.40 if row["month"] in ["April", "May", "June", "July", "August"] else 0.65
    na = nas_factor * row["minutes_delayed_nas"]
    return wa + la + na

df["weather_delay_est"] = df.apply(calc_weather, axis=1)

# Calculate weather proportion
weather_prop = (
    df.groupby("airport_code")
      .agg(
          total_flights=("num_of_flights_total", "sum"),
          weather_mins=("weather_delay_est", "sum")
      )
      .assign(prop_weather=lambda d: d["weather_mins"] / (d["total_flights"] * 60))
      .reset_index()
      .sort_values("prop_weather", ascending=False)
)

# Plot
p_weather = (
    ggplot(weather_prop, aes(x="airport_code", y="prop_weather"))
    + geom_bar(stat="identity", fill="#4682B4")  # hex for steelblue


    + labs(
        title="Proportion of Flights Delayed by Weather (Estimated)",
        x="Airport Code",
        y="Delay Proportion"
    )
    + theme_minimal()
)

p_weather

STRETCH QUESTION|TASK 1

Which delay is the worst delay? Create a similar analysis as above for Weather Delay with: Carrier Delay and Security Delay. Compare the proportion of delay for each of the three categories in a Chart and a Table. Describe your results.

Show the code
import pandas as pd
import numpy as np
from lets_plot import *
LetsPlot.setup_html(isolated_frame=True)

# Load and clean dataset
df = pd.read_json("https://github.com/byuidatascience/data4missing/raw/master/data-raw/flights_missing/flights_missing.json")
df = df.replace(["", "NA", None], np.nan)

# Create total_delay column
df["total_delay"] = (
    df["minutes_delayed_carrier"].fillna(0) +
    df["minutes_delayed_weather"].fillna(0) +
    df["minutes_delayed_nas"].fillna(0) +
    df["minutes_delayed_security"].fillna(0) +
    df["minutes_delayed_late_aircraft"].fillna(0)
)

# Fill late aircraft delay with mean
mean_late = df["minutes_delayed_late_aircraft"].mean(skipna=True)
df["minutes_delayed_late_aircraft"] = df["minutes_delayed_late_aircraft"].fillna(mean_late)

# Compute estimated weather delay
def calc_weather(row):
    wa = row["minutes_delayed_weather"]
    la = 0.30 * row["minutes_delayed_late_aircraft"]
    nas_factor = 0.40 if row["month"] in ["April", "May", "June", "July", "August"] else 0.65
    na = nas_factor * row["minutes_delayed_nas"]
    return wa + la + na

df["weather_delay_est"] = df.apply(calc_weather, axis=1)

# Aggregate delays
delay_types = (
    df.groupby("airport_code")
      .agg(
          total_flights=("num_of_flights_total", "sum"),
          weather_delay=("weather_delay_est", "sum"),
          carrier_delay=("minutes_delayed_carrier", "sum"),
          security_delay=("minutes_delayed_security", "sum")
      )
)

# Calculate proportions
delay_types = delay_types.assign(
    prop_weather=delay_types["weather_delay"] / (delay_types["total_flights"] * 60),
    prop_carrier=delay_types["carrier_delay"] / (delay_types["total_flights"] * 60),
    prop_security=delay_types["security_delay"] / (delay_types["total_flights"] * 60)
).reset_index()

# Prepare for plot
plot_df = delay_types.melt(
    id_vars="airport_code",
    value_vars=["prop_carrier", "prop_weather", "prop_security"],
    var_name="delay_category",
    value_name="proportion"
)

# Plot
p_delay = (
    ggplot(plot_df, aes(x="airport_code", y="proportion", fill="delay_category"))
    + geom_bar(stat="identity", position="dodge")

    + labs(
        title="Comparison of Delay Proportions by Category and Airport",
        x="Airport Code", y="Proportion of Flights Delayed",
        fill="Delay Category"
    )
    + theme_minimal()
)
p_delay

*Insight (Stretch Task): This chart shows that, on average, Carrier delays have the highest proportion across airports, while Security delays tend to have the lowest. However, Weather delays can exceed carrier delays at specific airports — like IAD — highlighting its localized impact.